HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746.
Here, we explore this data set and try to answer the question, “What makes people happy?”
From the packages’ descriptions:
tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures;tidytext allows text mining using ‘dplyr’, ‘ggplot2’, and other tidy tools;DT provides an R interface to the JavaScript library DataTables;scales map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends;wordcloud2 provides an HTML5 interface to wordcloud for data visualization;gridExtra contains miscellaneous functions for “grid” graphics;ngram is for constructing n-grams (“tokenizing”), as well as generating new text based on the n-gram structure of a given text input (“babbling”);Shiny is an R package that makes it easy to build interactive web apps straight from R;devtools::install_github("lchiffon/wordcloud2")
library(tidyverse)
library(tidytext)
library(DT)
library(scales)
library(gridExtra)
library(ngram)
library(ggplot2)
library(wordcloud2)
library(tidyr)
library(reshape2)
library(tm)
library(topicmodels)
We use the processed data for our analysis and combine it with the demographic information available.
hm_data <- read_csv("../output/processed_moments.csv")
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)
We select a subset of the data that satisfies specific row conditions.
hm_data <- hm_data %>%
inner_join(demo_data, by = "wid") %>%
select(wid,
original_hm,
cleaned_hm,
gender,
marital,
parenthood,
reflection_period,
age,
country,
ground_truth_category,
text) %>%
mutate(count = sapply(hm_data$text, wordcount)) %>%
filter(gender %in% c("m", "f")) %>%
filter(marital %in% c("single", "married")) %>%
filter(parenthood %in% c("n", "y")) %>%
filter(reflection_period %in% c("24h", "3m")) %>%
mutate(reflection_period = fct_recode(reflection_period, #Change factor levels by hand
months_3 = "3m", hours_24 = "24h"))
head(hm_data)
## # A tibble: 6 x 12
## wid original_hm cleaned_hm gender marital parenthood reflection_peri…
## <int> <chr> <chr> <chr> <chr> <chr> <fct>
## 1 2053 I went on a… I went on… m single n hours_24
## 2 2 I was happy… I was hap… m married y hours_24
## 3 1936 I went to t… I went to… f married y hours_24
## 4 206 We had a se… We had a … f married n hours_24
## 5 45 I meditated… I meditat… m single n hours_24
## 6 195 I made a ne… I made a … m single n hours_24
## # ... with 5 more variables: age <chr>, country <chr>,
## # ground_truth_category <chr>, text <chr>, count <int>
datatable(hm_data)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html